Workflow Provenance Repository

نویسندگان

  • Víctor Cuevas-Vicenttín
  • Parisa Kianmajd
  • Bertram Ludäscher
  • Yaxing Wei
  • David Koop
چکیده

Scientific workflows and their supporting systems are becoming increasingly popular for compute-intensive and data-intensive scientific experiments. The advantages scientific workflows offer include rapid and easy workflow design, software and data reuse, scalable execution, sharing and collaboration, and other advantages that altogether facilitate “reproducible science”. In this context, provenance, information about the origin, context, derivation, ownership, or history of some artifact plays a key role, since scientists are interested in examining and auditing the results of scientific experiments. However, in order to perform such analyses on scientific results as part of extended research collaborations, an adequate environment and tools are required. Concretely, the need arises for a repository that will facilitate sharing scientific workflows and their associated execution traces in an interoperable manner, also enabling querying and visualization. Furthermore, such functionality should be supported while taking performance and scalability into account. With this purpose in mind we introduce PBase, a scientific workflow provenance repository implementing the PROV-ONE proposed standard, which extends the emerging W3C PROV standard for provenance data with workflow specific concepts. PBase is built on the Neo4j graph database, thus offering capabilities such as declarative and efficient querying. Our experiences demonstrate the power gained by supporting various types of queries for provenance data. In addition, PBase is equipped with a friendly user interface tailored for the visualization of scientific workflow provenance data, making also the specification of queries and the interpretation of their results easier and more effective. Víctor Cuevas-Vicenttín 1 , Parisa Kianmajd 1 , Bertram Ludäscher 1 , Paolo Missier 2 , Fernando Chirigati 3 , Yaxing Wei 4 , David Koop 3 , Saumen Dey 1 1 University of California at Davis, 2 Newcastle University, 3 Polytechnic Institute of NYU, 4 Oak Ridge National Laboratory [email protected], [email protected], [email protected] [email protected], [email protected], [email protected], [email protected], [email protected] 2 | The PBase Scientific Workflow Provenance Repository IDCC14 | Practice Paper Introduction The origin and processing history of an artifact is known as its provenance. Data provenance is an important form of metadata that explains how a particular data product was generated, including the system and the steps in the computational process involved along with the user responsible for its execution, time, and resources used, such as parameter settings, input data, software tools, etc. Provenance information provides transparency and helps to audit processes and interpret data products. Common uses and applications of provenance include quality control, data curation, debugging, data discovery, and generally, the validation, attribution, and reproducibility of scientific results. The state of the art of scientific workflow systems such as Kepler (Ludäscher et al., 2006), Taverna (Wolstencroft et al., 2013), VisTrails (Freire et al., 2006) provide controlled environments for specifying and enacting complex computational pipelines for which provenance information is automatically captured by the system in the form of traces (albeit often in proprietary formats). However, they lack the capabilities to enable the users to effectively query and visualize the provenance traces associated with a particular workflow. In addition, they do not address the need of scientists of sharing their workflow-based computational experiments for the scrutiny and benefit of the rest of the community. Therefore, the need arises for a repository allowing multiple users to store and query scientific workflow provenance in an interoperable manner. In this work we document the task of developing and putting into use such a repository, called PBase. Our ultimate goal is to incorporate such a repository into DataONE 1 , a large scale and federated data infrastructure serving the Earth sciences community. In order to build PBase we face three main challenges. i) Defining a standard model for workflow provenance representation that is compatible with the main scientific workflow systems. ii) Characterizing the required functionality in terms of specific queries. iii) Developing the infrastructure that enables to evaluate such queries efficiently, while also being easy to use for the scientific community. We briefly describe next how we addressed those challenges. First, describing PROV-ONE, our proposed standard for scientific workflow provenance data, after which we present representative provenance queries. Subsequently, we describe the PBase architecture and how users interact with the system. Finally, we discuss related works, present our conclusions and discuss future work. The PROV-ONE Model for Workflow Provenance Scientific workflow systems provide a user-friendly graphical interface to specify a computational process in the form of a directed graph of interconnected tasks. Such a graph is an abstraction that can be regarded as prospective provenance, since it details the steps to follow in order to generate the desired result. In addition, as mentioned earlier, the workflow system automatically captures the various events associated with the workflow execution, which is usually referred to as retrospective provenance. The full potential of provenance information is achieved when combining both. To this end, 1 http://www.dataone.org/ Cuevas, Kianmajd, Ludäscher, Missier, et al. | 3 IDCC14 | Practice Paper in the context of the DataONE project, an extension of the W3C PROV standard called PROV-ONE has been developed. PROV (PROV-WG, 2013) provides a basic and standardized model for the representation and exchange of provenance information. PROV-ONE uses and extends the information elements of PROV to describe both workflow-level and trace-level information. The adoption of PROV-ONE establishes a common model for PBase and brings the advantages in interoperability of the emerging PROV standard. Support for PROV-ONE, serialized as XML, was recently added to VisTrails and other workflow systems can be supported through corresponding wrappers as well. Figure 1. PROV-ONE conceptual model UML class diagram. PROV-ONE Conceptual Model The PROV-ONE conceptual model is illustrated by the UML diagram of Figure 1. All classes have a correspondant PROV type denoted by a UML stereotype (e.g. «entity»), whereas this is the case for only a subset of the associations (e.g. «used»). The various tasks that form part of a workflow are represented by the Process class. Processes can be either atomic or composite, the latter case specified through the hasSubProcess self association. A given Process can be distinguished as a Workflow. Each Process has a series of InputPorts and OutputPorts, while the ports from the various Processes are connected through DataLinks. Note that both input and output ports can be associated with multiple DataLinks, thus allowing workflow models in which a single output is copied and sent to multiple destinations as well as in which tasks take inputs from different sources through a single input port. In order to specify executable instances of a Workflow, default parameters can be defined for some of its constituting Processes. The default parameters are represented by Data elements, which can be on various types (e.g. XML, JSON, files, etc.) including collections thereof. In addition, sequential control links can be specified between Processes as denoted by the SeqCtrlLink class, which enforce that a given Process can only be executed after the other Process that shares the link has sent the required signal. Finally, a particular Process that specifies a Workflow or part of it can be associated with a User that assumes responsibility for its creation. A detailed presentation of the PROV-ONE model based on an OWL-2 ontology can be found in 4 | The PBase Scientific Workflow Provenance Repository IDCC14 | Practice Paper (DataONE-ProvWG, 2014). An example of a concrete workflow and its corresponding trace is presented in the next section. ScientificWorkflow Provenance Queries PROV-ONE data is intrinsically graph-oriented, since workflows are represented as graphs having processes as nodes and traces in part as graphs whose nodes correspond to data entities (inputs and outputs) and computational activities (process executions). Therefore, the queries of interest over scientific workflow provenance correspond in many cases to queries over graphs, which have been extensively studied and employed in various domains. Figure 2 a) presents an example workflow whose goal is to summarize ecological spatiotemporal data. The left branch calculates the standard deviation of the regional net ecosystem exchange data obtained from its input, and generates a map with 1 degree resolution illustrating the standard deviation of the data across the region. The right branch uses the same data to calculate monthly averages and presents the results in a plot. A trace generated by running this workflow is depicted in Figure 2 b), with data items as ellipses and process executions as rectangles (some data and process execution nodes have been omitted for clarity). Figure 2. Example workflow (a) and corresponding trace (b). In cooperation with climate scientists that employ scientific workflows such as the one presented in Figure 2 a) in their research, we identified a suite of queries that are relevant for scientific workflow provenance. The queries in the suite can be classified in four main types. i) Lineage queries, i.e., dealing with the derivation of data products. For example, “what are the datasets involved in the generation of this plot?”. ii) Execution analysis queries, for example, “find the processes in a workflow that were not Cuevas, Kianmajd, Ludäscher, Missier, et al. | 5 IDCC14 | Practice Paper completed”. iii) Search queries, for example, “find all the workflows that used a bilinear regrid module”. iv) Statistical queries, such as “list the most used modules across all workflows”. Table 1 presents six queries that are incorporated into our implementation and their query type. Queries 1-3 can also be characterized as focusing on workflows, while queries 4-6 focus on traces. This distinction becomes relevant for user interaction as discussed in the next section. # Query Type 1 Compute the number of invocations of process e7_Regrid Statistical 2 Find all inputs of a workflow across executions Lineage 3 Find the modules of processes that were not executed Execution 4 Find the process executions that were not completed Execution 5 Find the processes that used data item e10_regrid_method Search 6 Find the data products influenced by module e7_Regrid Search Table 1. Example queries for scientific workflow provenance data. The PBase Provenance Repository Provenance Data Storage and Querying The challenges involved in developing the infrastructure capable of evaluating the aforementioned queries are twofold. First, we need an expressive query language covering diverse types of queries. Second, query evaluation should be efficient considering that provenance traces can be significantly large. We opted to base our infrastructure on the NoSQL graph database Neo4j 2 , following the rationale that it is customized for graph data and that it has recently incorporated Cypher, a declarative graph query language. 1 START n = node:node_auto_index(name="e7_Regrid") MATCH m-[:wasAssociatedWith]-n RETURN count(m) 2 START n=node(*) MATCH (n)<-[:used]-() WHERE not ((n)-[:wasGeneratedBy]->()) AND HAS(n.wfID) AND (n.wfID="wf1") RETURN DISTINCT n 3 START n=node(*) MATCH n<-[:wasAssociatedWith]-m WHERE HAS(n.vtType) AND HAS(n.wfID) AND HAS(n.runID) AND n.vtType="vt:module_exec" AND n.completed=-1 AND n.wfID="wf1" AND n.runID="ex1" RETURN m 4 START n=node(*) WHERE HAS(n.completed) AND n.completed=-1 AND HAS(n.wfID) AND n.wfID="wf1" RETURN DISTINCT n 5 START n = node:node_auto_index(name="e10_regrid_method") MATCH n<-[:used]-a RETURN distinct a 6 START n=node(*) MATCH n-[*]->a WHERE n-[:wasGeneratedBy]->() AND HAS(n.wfID) AND n.wfID="wf1" AND HAS(a.module) AND a.module="e7_Regrid" RETURN DISTINCT n Table 2. Cypher specification of the queries in Table 1. Table 2 presents the Cypher specification of the queries presented in the previous section. The basic structure of a Cypher query comprises a START clause that defines a 2 http://www.neo4j.org/ 6 | The PBase Scientific Workflow Provenance Repository IDCC14 | Practice Paper series of starting points for graph exploration, a MATCH clause that specifies a pattern to be matched in the graph, a WHERE clause specifying additional filter conditions, and a RETURN clause that outputs the results, which can consist of complete nodes and edges or only of attributes thereof. A detailed discussion of the language is presented in (Neo Technology, 2014). In some cases, however, the performance of Neo4j was found to be unsatisfactory when querying large traces. An alternative to address this issue is to incorporate new indexing and encoding techniques into Neo4j to achieve better performance. For reachability queries, a particular case of lineage that determines whether there is a directed path between a pair of nodes, we achieved major improvements by incorporating the tree cover encoding introduced in (Agrawal et al., 1989) to the stored workflows and traces. This form of encoding can be generated in quadratic time on the number of nodes and in the worst case can be of a linear size for some nodes. In practice, however, its size is significantly lower and enables to answer reachability queries in nearly a constant time. For a test graph of 10,000 nodes the encoding enabled us to evaluate 8,000 random queries in under a minute on a laptop, while it took more than two hours without it.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Managing Provenance in Scientific Workflows with ProvManager

Running scientific workflows in distributed environments is motivating the definition of provenance gathering approaches that are loosely coupled to the workflow systems. We have proposed a provenance gathering strategy that is independent from workflow system technology. This strategy has evolved into a provenance management system named ProvManager. The main principle is that each workflow ac...

متن کامل

Using Domain-Specific Data to Enhance Scientific Workflow Steering Queries

In scientific workflows, provenance data helps scientists in understanding, evaluating and reproducing their results. Provenance data generated at runtime can also support workflow steering mechanisms. Steering facilities for workflows is considered a challenge due to its dynamic demands during execution. To steer, for example, scientists should be able to suspend (or stop) a workflow execution...

متن کامل

Mapping the NRC Dataflow Model to the Open Provenance Model

The Open Provenance Model (OPM) has recently been proposed as an exchange framework for workflow provenance information. In this paper we show how the NRC data model for workflow repositories can be mapped to the OPM. Our mapping includes such features as complex data flow in an execution of a workflow; different workflows in the repository that call each other; and the tracking of subvalues of...

متن کامل

On User Views in Scientific Workflow Systems

An increasing number of scientific workflow systems are providing support for the automated tracking and storage of provenance information. However, the amount of provenance information recorded can become very large, even for a single execution of a workflow – [6] estimates a ten-fold blowup of the size of the original input data. There is therefore a need to provide ways of allowing users to ...

متن کامل

Integration of oreChem with the eCrystals repository for crystal structures

This paper describes the integration of the oreChem Core Ontology (CO), a top-level ontology for the description of the planning and enactment of scientific methods, with the eCrystals repository for crystal structures. Records in the eCrystals repository constitute all fundamental and derived data that is obtained as the result of the execution of a crystal structure determination workflow. Ho...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014